fix: real liveness probe on /health (Bug #8)#54
Open
avrabe wants to merge 1 commit into
Open
Conversation
Why: DevOps/SRE flagged that /health returned 200 unconditionally.
PM2's restart-on-failed-healthcheck and any external uptime monitor
were both blind to a hung scheduler, a locked SQLite DB, or a full
data disk. The bot could be silently dead for hours.
What:
- New module src/health.js: pure-function evaluateHealth(probes)
with three checks (scheduler tick freshness, SQLite ping, disk
free) and three states (healthy / degraded / unhealthy).
- Scheduler exposes getLastTickAt() and getIntervalMs(); tick
timestamp is updated in finally so even error paths refresh it.
/health flags the scheduler as 'fail' when the last tick is older
than 2× the interval.
- persistent-kv exposes ping() — a SELECT 1 against the open
Database. Throws on locked / broken file.
- app.js wires both /health endpoints (Express router + Probot v14
addHandler) through evaluateHealth and returns 503 when
probe.ok=false. The body always includes a `checks` map so PM2
logs / dashboards can see *which* check tripped.
The three states:
healthy: every check passed → 200
degraded: a probe missing, threw, or scheduler had no
tick yet — not actionable but visible → 200
unhealthy: a check failed in a way that means the bot
cannot do its job (DB ping throws, scheduler
hung, disk full) → 503
Test plan:
- 12 new unit tests in __tests__/unit/health.test.js covering each
probe's pass/fail/error/missing branches plus the most-severe-wins
aggregation rule.
- Full suite: 846 pass (was 834), lint clean.
- Existing /health integration tests still pass — when probes are
not injected (test setup leaves them null) evaluateHealth returns
healthy so the 200 path is unchanged.
Risk: medium. The behaviour change is intentional — /health can now
return 503. Any external monitor that currently treats temper as
always-healthy should be re-checked, but the whole point is that they
should now restart on real outages.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
What's checked
Why three states
`degraded` exists for "the probe itself broke" cases (e.g. `statfs` throws on a missing path). Returning 200 keeps PM2 from restarting (which wouldn't fix the underlying probe issue) while still flagging it visibly in the response body.
Test plan
Risk
Medium — this is an intentional behaviour change. `/health` can now return non-200. External monitors that previously treated temper as always-healthy should be re-checked, but the whole point is that they should restart on real outages.
🤖 Generated with Claude Code